Provenance for Generalized Map and Reduce Workflows
نویسندگان
چکیده
We consider a class of workflows, which we call generalized map and reduce workflows (GMRWs), where input data sets are processed by an acyclic graph of map and reduce functions to produce output results. We show how data provenance (also sometimes called lineage) can be captured for map and reduce functions transparently. The captured provenance can then be used to support backward tracing (finding the input subsets that contributed to a given output element) and forward tracing (determining which output elements were derived from a particular input element). We provide formal underpinnings for provenance in GMRWs, and we identify properties that are guaranteed to hold when provenance is applied recursively. We have built a prototype system that supports provenance capture and tracing as an extension to Hadoop. Our system uses a wrapper-based approach, requiring little if any user intervention in most cases, and retaining Hadoop’s parallel execution and fault tolerance. Performance numbers from our system are reported.
منابع مشابه
RAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows
RAMP (Reduce And Map Provenance) is an extension to Hadoop that supports provenance capture and tracing for workflows of MapReduce jobs. RAMP uses a wrapper-based approach, requiring little if any user intervention in most cases, while retaining Hadoop’s parallel execution and fault tolerance. We demonstrate RAMP on a real-world MapReduce workflow generated from a Pig script that performs senti...
متن کاملA Provenance-Integration Framework for Distributed Workflows in Grid Environments
Provenance information about complex and distributed workflows is a key issue for data quality control and data reliability maintenance in reservoir management. Distributed and integrated environments where different workflows consume and transform data require a comprehensive provenance view. In this scenario provenance collection and integration presents significant challenges. In this paper,...
متن کاملUnderstanding Collaborative Studies through Interoperable Workflow Provenance
The provenance of a data product contains information about how the product was derived, and is crucial for enabling scientists to easily understand, reproduce, and verify scientific results. Currently, most provenance models are designed to capture the provenance related to a single run, and mostly executed by a single user. However, a scientific discovery is often the result of methodical exe...
متن کاملProvenance trails in the Wings/Pegasus system
Our research focuses on creating and executing large-scale scientific workflows that often involve thousands of computations over distributed, shared resources. We describe an approach to workflow creation and refinement that uses semantic representations to 1) describe complex scientific applications in a data-independent manner, 2) automatically generate workflows of computations for given da...
متن کاملHadoopProv: Towards Provenance as a First Class Citizen in MapReduce
We introduce HadoopProv, a modified version of Hadoop that implements provenance capture and analysis in MapReduce jobs. It is designed to minimise provenance capture overheads by (i) treating provenance tracking in Map and Reduce phases separately, and (ii) deferring construction of the provenance graph to the query stage. Provenance graphs are later joined on matching intermediate keys of the...
متن کامل